{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "apS7a7WpPARK"
      },
      "source": [
        "Install llama-cpp-python, remove cmake_args argument if not using GPU."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QXcUGm7SB9er"
      },
      "outputs": [],
      "source": [
        "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python # CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B8XfFxObOq9z"
      },
      "source": [
        "Download and extract ngrok. (following ngrok methods are only required when attempting to access from externally hosted notebooks)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L8gOGVWUGfPN"
      },
      "outputs": [],
      "source": [
        "!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip\n",
        "!unzip ngrok-stable-linux-amd64.zip"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8sAsirMoNs6x"
      },
      "source": [
        "Replace \"authentication token\" with your authentication token from ngrok."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8bEYZnntpic7"
      },
      "outputs": [],
      "source": [
        "!./ngrok authtoken \"authentication token\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9oz_2m9fNa8I"
      },
      "source": [
        "Set up ngrok to listen on port 8000. Wait 2s to ensure ngrok loaded. Print public URL."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mzSCUjTfGulo"
      },
      "outputs": [],
      "source": [
        "get_ipython().system_raw('./ngrok http 8000&')\n",
        "!sleep 2\n",
        "!curl -s http://localhost:4040/api/tunnels | python3 -c \\\n",
        "    \"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CbvarINgPHX9"
      },
      "source": [
        "Install llama-cpp-python server.\n",
        "\n",
        "Run model on server.\n",
        "\n",
        "I use google drive to host my models for retrieval when using methods like colab for notebooks, other methods exist.\n",
        "\n",
        "If using an instance without GPU compute, remove --n_gpu_layers N_GPU_LAYERS.\n",
        "\n",
        "The more layers you can load to GPU, the more tokens/s you'll achieve.\n",
        "\n",
        "Larger contexts use more memory. Typical max context is 4096 for LLama2 models.\n",
        "\n",
        "n_gpu_layers = 42 is suitable for a T4 GPU with 13b parameter, q5_k_m quantised (or smaller) models, when using a context length of 4096."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sg_wDjrqCijU"
      },
      "outputs": [],
      "source": [
        "!pip install llama-cpp-python[server]\n",
        "!python3 -m llama_cpp.server --model drive/MyDrive/models/MythoMax-L2-13b-q5_k_m.gguf --n_ctx 4096 --n_gpu_layers 42 --host 0.0.0.0"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
